'Kay. Gosh. 'Kay. Is there much more in it than he d Is there much more in it than he said yesterday? Mm. Hmm. Hmm? Yeah, now I'd say if for the prototype if we just like wherever possible p chunk in the stuff that we have um pre-annotated and stuff, and for the stuff that we don't have pre-annotated write like a stupid baseline, then we should probably be able to basically that means we focus on on the interface first sort of, so that we we take the the ready-made parts and just see how we get them work together in the interface the way we want and and then we have a working prototype. And then we can go back and replace pieces either by our own components or by more sophisticated compo po components of our own. So it's probably feasible. The thing is I'm away this weekend. So that's for me Oh yeah, um yeah. No. But also I might like the the similarity thing, like my just my matrix itself for my stuff, I c I I think I can do that fairly quickly because I have the algorithms. Yeah, I think today's meeting is really the one where we where we sort of settle down the data structure and as soon as we have that, uh probably like after today's meeting, we then actually need to well go back first of all and look at NITE X_M_L_ to see in how far that that which we want is compatible with that which NITE X_M_L_ offers us. And then just sort of everyone make sure everyone understand the interface. So I think if today we decide on what data we wanna have now, and and later, maybe even today, we go and look at NITE X_M_L_ or some of us look at NITE X_M_L_ in a bit more detail, just trying to make some sense of that code and see how does the representation work in their system. And then sort of with that knowledge we should be able to then say okay, that type of NITE X_M_L_ data we wanna load into it, and this is how everyone can access it, and then we should be able to go from there. No. I've looked looked at the documentation and n like seen enough to make me think that we want to use the NITE X_M_L_ framework because um they have a good a event model that synchronizes sort of the data and and every display element. So that takes a lot of work away from us. Sort of that would be a reason for staying within their framework and using their general classes. But beyond that I haven't looked at it at all, which is something we should really do. Who actually like for this whole discussion I mean, who of us is doing stuff that is happening on-line and who of us is doing stuff that's happening off-line? Like my data is coming c Hmm? Yeah. Okay. Okay. 'Kay. So basically apart from the display module, the i the display itself, we don't have an extremely high degree of interaction between sort of our modules that create the stuff and and the interface, so the interface is mainly while it's running just working on data that's just loaded from a file, I guess. There isn't Yeah, I know. Th Yeah, the search is I guess the search is sort of a strange beast anyway because for the search we're leaving the NITE X_M_L_ framework. Um but that's still sort of that's good. That means that at least like we don't have the type of situation where somebody has to do like a billion calculations on on data on-line. 'Cause that would make it a lot more like that would mean that our interface for the data would have to be a lot more careful about how it performs and and everything. And nobody is modifying that data at at on-line time at all it seems. Nobody's making any changes to the actual data on-line. So that's actually making it a lot easier. That basically means our browser really is a viewer mostly, which isn't doing much with the data except for sort of selecting a piece piece of it and and displaying it. Hmm? Well some parts relevant for the search, yes. I'd say so. Hmm? Yeah, but nobody of us is doing much of searching from the data in the on-line stage. And for all together, like the display itself, I think we are easier if we if it's sitting on the X_M_L_ than if it's sitting on the S_Q_L_ stuff, because if it's sitting on the X_M_L_, we have the the NITE X_M_L_ framework with all its functionality for synchronizing through all the different levels whenever there's a change, whenever something's moving forward and stuff. And we can just more or less look at their code, like how their player moves forward, and how that moving forward is represented in different windows and stuff. So I think in the actual browser itself I don't wanna sit on the S_Q_L_ if we can sit on the X_M_L_ because sitting on the X_M_L_ we have all we have so much help. And for y for like the p the calculations that we're doing apart from the search, it seems that everyone needs some special representations anyway. You mean our results? Yeah, in in the NITE X_M_L_ X_M_L_ format, so with their time stamps and stuff, so that it's easy to to tie together st things. What I'm like what we have to think about is if we go with this multi-level idea, like this idea that sort of if you start with a whole meeting series as one entity, as one thing that you display, as one whole sort of, that then the individual chunks of the individual meetings, whereas and then you can click on a meeting, and then sort of the meeting is the whole thing and the chunks are the individual segments, that means sort of we have multiple levels of of representation, which we probably If we if we do it this way like we f we have to discuss that if we do it this way, then we should probably find some abstraction model, so that the interface in the sense like deals with it as if it's same so that the interface doesn't really have to worry whether it's a meeting in the whole meeting series or a segment within a meeting, you know what I mean? And that's probably stuff that we have to sort of like process twice then. Like for example that like the summary of a meeting within the whole meeting corpus or meeting series y is meeting series a good word for that? I don't really know what how to call it. You know what I mean, like not not the whole corpus, but every meeting that has to do with one topic. Um so in in the meeting se series so that a summary for a meeting within the meeting series, are sort of compiled off-line by a summary module. And that is separate from a summary of a segment within a meeting. 'Cause I don't think we can So are we doing that at all levels? Are we um And just have different like fine-grainedness levels sort of. Mm. 'Kay. So the only thing that yeah, so the only thing that would happen basically if I double-click let's say from the whole meeting series on a single meeting, is that the zoom level changes. Like the th the start and the end position changes and the zoom level changes. I I thought we couldn't do that. Like I was under the impression that we couldn't do that because we couldn't load the data for all that. But I don't know, I mean that So I'm s not sure if I got it. I was Mm-hmm. Mm-hmm. Mm-hmm. Mm-hmm. Okay. So Okay. I wa I was just worried about the total memory complexity of it. But I I completely admit, I mean, I just sort of like th took that from some thing that Jonathan once said about not loading everything. But maybe I was just wrong about it. How many utterances w Yeah, and I w yeah. Yeah. Yeah. Yeah. So what we have is we would have a word. Like we would have words with some priority levels. And they would basically be because even the selection would would the summaries automatically feed from just how prioritized an individual word or how indiv uh prioritized an individual utterance is? Or i are the summaries sort of refined from it and made by a machine to make sentences and stuff? Or are they just sort of taking out the words with the highest priority and then the words of the second highest priority? And the u okay. Are we doing it on th the whole thing on the utterance level? Or are we doing it on word level, like the information density calculation? We I think we have start and end times for words actually, but it's yeah, but it m it might s but it might sound crazy in the player. We should really maybe we can do that together at some point today that we check out how the player works. But there's maybe some merit in altogether doing it on an utterance level in the end. So Yeah. Well but also about the displays, I mean the displays in the in the text body, in the in the latest draft that we had sort of we came up with the idea that it isn't displaying utterance for utterance, but it's also displaying uh a summarised version in you know, like below the below the graph, the part. Maybe Yeah, r Hmm? Oh yeah, f it's just like there there's like audio skimming and there's displayed skimming. Yeah. Ma maybe there's some merit of going altogether for utterance level and not even bother to calculate I mean if you have to do it internally, then you can do it. But maybe like not even store the importance levels for individual words and just sort of rank utterances as a whole. Hmm? Yeah. 'Cause it it might be better skimming and less memory required at the same time. And I mean if you if you know how to do it for individual words, then you can just in the worst case, if you can't find anything else, just sort of make the mean of the words over the utterance. You know what I mean? W it's it's Well what's the smallest chunk at the moment you're thinking of of assigning an importance measure to, is it a word or is it an utterance? So we're thinking of like maybe just storing it on a per utterance level. Because it's it's less stuff to store probably for Dave in the in the audio playing. And for in the display it's probably better if you have whole utterances than I don't know, like what it's like if you just take single words out of utterances. That probably doesn't make any sense at all, whereas if you just uh show important utterances but the utterance as a whole it makes more sense. So it doesn't actually make a difference for your algorithm, 'cause it just means that if you're working on a word level, then we just mean it over the utterance. They are on Oh so that's good anyway then, yeah. Because that makes it a lot easier than to t put it on utterance level. Oh yeah. No but I mean like how how Jasmine does it internally I don't know, but it's probably, yeah, you probably have to work on word levels for importance. But there should be ways of easily going from a word level to an utterance level. Okay. Yeah, prob Hmm. Well we do a pre-filtering of sort of the whole thing, sort of like but that, like the problem with that is it's easy to do in the text level. But that would mean it would still play the uh in your audio, unless we sort of also store what pieces we cut out for the audio. Yeah. I think before we can like answer that specific question how we c deal with that, it's probably good for us to look at what the audio player is capable of doing. Yes. So what do you mean by buffering? Like you think directly feeding But yeah, but not but not stored on the hard disk and then loaded in, but loaded in directly from memory. But it's probably a stream if it exists in Java, it would be probably some binary stream going in of some type. Okay, yeah. Okay. Okay, so I mean so that means that there's probably, even if you go on an per utterance level, there's still some merit on within utterances cutting out stuff which clearly isn't relevant at all, and that maybe also for the audio we'd have to do. So let's say we play the whole au phrase, but then in addition to that, we have some information that says minus that part of something. That's okay, that we can do. Yeah, maybe even I mean that's sort of that depends on how how advanced we get. If maybe if we realise that there's massive differences in in gain or in something, you can probably just make some simple simple normalization, but that really depends on how much time we have and and how much is necessary. Yeah, if like I d I don't know anything about audio and I have never seen the player. So if you find that the player accepts some n input from memory, and if it's easy to do, then I guess that's that's fairly doable. So but that means in the general structure we're actually quite lucky, so we we have we load into memory for the whole series of meetings just the utterances and rankings for the utterances and some information probably that says, well, the I guess that goes with the utterance, who's speaking. Because then we can also do the display about who's speaking. Yeah. But I'm I'm still confused 'cause I thought like that's just what Jonathan said we do c that we can't do, like load a massive document of that size. On the other hand The other hand, I mean it shouldn't be like should be like fifty mega-byte in RAM or something, it shouldn't be massive, should it? Actually fifty hundred megabyte is quite big in RAM. Just thinking, what's the simp so We do get an error message with the project if we load everything into the project with all the data they load. So we know that doesn't work. So our hope is essentially that we load less into it. What's this lazy loading thing, somebody explain lazy loading to me. Ah, okay. So that is that only by type of file. Like if if if the same thing is in different files, would it then maybe like, you know, if if utterances are split over three or ten or w hundred different files, is then a chance maybe that it doesn't try to load them all into memory at the same time, but just So why does it fail then in the first place? Then it shouldn't ever fail, because then it should never Yeah, but yeah, but um it uh it it failed right when you load it, right, the NITE X_M_L_ kit, so that's interesting. Hmm. Let's check that out. Um I'll p I'll probably ask Jonathan about it. So alternatively, if we realise we can't do the whole thing in one go, we can probably just process some sort of meta-data, you know what I mean, like sort of sort of for the whole series chunks representing the individual meetings or some Like something that represents the whole series in in a v in a structure very similar to the structure in which we represent individual um meetings, but with data sort of always combined from the whole series. so instead of having an single utterance that we display, it would probably be like that would be representing a whole um topic, a segment in a meeting. And sort of so that wi using the same data st Well, in a sense Uh I'm I'm thinking of in a sense of like creating a virtual a virtual meeting out of the whole meeting series, sort of. Yeah, sort of like off-line create a virtual meeting, which which basically treats the meeting series as if it was a meeting, and treats the individual meetings within the series as if they were segments, and treats the individual segments within meetings as if they were um utterances. You know, so we just sort of we shift it one level up. And in that way we could probably use the same algorithm and just like make vir like one or two ifs that say okay, if you are on a whole document uh a whole series level and that was a double-click, then don't just go into that um segment, but load a new file or something like it, but in general use the same algorithm. That would be an alternative if we can't actually load the whole thing and 'Cause also like even if we maybe this whole like maybe I'm worrying too much about the whole series in one thing display, because actually I mean probably users wouldn't view that one too often. Yeah, but I'm I'm still worried. Like for example for the display, if you actually if you want a display uh like for the whole series, the information density levels based on and and the f and the only granularity you have is individual utterances, that means you have to through every single utterance in a series of seventy hours of meetings. Yeah. Yeah, and if you make that structurally very similar to the the le like one level down, like the way how we uh store individual utterances and stuff, then maybe we can more or less use the same code and just make a few ifs and stuff. Yeah, so so but still so in in general we're having we're having utterances and they have a score. And that's as much as we really need. And of cou and they also have a time a time information of course. Hmm? And a and a s and a speaker information, yeah. Yeah, so an information which topic they're in, yeah. And and probably separate to that an information about the different topics like that Yeah. So so the skimming can work on that because the skimming just sort of sorts the utterances and puts as many in as it needs Yeah. Yeah, it'll it'll play them in some order in which they were set because otherwise it's gonna be more entertaining. Um but that that's enough data for the skimming and the the searching, so what the searching does is the searching leaves the whole framework, goes to the S_Q_L_ database and gets like basically in the end gets just a time marker for where that is, like that utterance that we are concerned with. And then we have to find I'm sure there's some way in in NITE X_M_L_ to just say set position to that time mark. And then it shifts the whole frame and it alerts every single element of the display and the display updates. Yeah, yeah. That we can ju yeah, but so so if if somethi so yeah. So if in that tree display somebody clicks on something Yeah, and then you sort of feed the time stamp to and the NITE X_M_L_ central manager, and that central manager alerts everything that's there, like alerts the skim like the the audio display, alerts the text display, alerts the visual display and says we have a new time frame and then they all sort of do their update routines with respect to the current level of zoom. So how much do they display, and starting position at where the or maybe the mid-position of it, I don't know, like w if start where the thing was found or if that thing wa was found it's in the middle of the part that we display, that I don't know. But that we can decide about, but a general sort of It's the same thing if like whether you play and it moves forward or whether you jump to a position through search, it's essentially for all the window handling, it's the same event. It's only that the event gets triggered by the search routine which sort of push that into NITE X_M_L_ and says please go there now. Why do we have to do it in memory? But that stuff's so I mean like the information is coming from off-line. So we probably we don't even have to change the utterance document, right, because the whole way, like the whole beauty of the NITE X_M_L_ is that it ties together lots of different files. So we can just create an additional X_M_L_ file which for every utterance like the utterances have I_D_s I presume, some references. So we just we tie uh p just a very short X_M_L_ file, which it's the only information it has that has whatever a number for for the um weight, for the information density, and we just tie that to the existing utterances and tie them to the existing speaker changes. Well otherwise we probably have to go over it and like add some integer that we just increment from top to bottom sort of to every utterance as an as an I_D_ some type. Or un or try to understand how NITE X_M_L_ I_D_s work and maybe there's some special route we have to follow when we use these I_D_s. It's alm hmm? Yeah, the the girl said the utterances themselves are not numbered at the moment. Okay. Okay. Okay. Yeah. So I guess that would be solvable if not. Mm-hmm. Sorry? Okay. Okay. Is that a board marker pen actually? Oh. That's just so like to make a list of all this stuff, or we probably can somebody can do it on paper. All these fancy pens. So what so the stuff we have we have utterances and speakers and weights for utterances. So for for every utterance sort of like the utterance has a speaker and a weight which is coming from outside. Or we just tie it to it. And there is segments, which hmm? Oh, so sorry um. Uh topic s topic segments I meant. Like they are they are a super-unit. So so the utterances are tied to topic segments. And if the time stamps are on a word level, then we b somehow have to extract time stamps for utterances where they start. W what segments now? Okay. Is the uh is that the same as utterances that is that the same as utterances that Mm-hmm. Mm-hmm. What so that's Oh. But that's one o one segment or is that two segments then? Yeah. Okay. Okay. So but but generally utterances is that which we just called uh sorry, segments is that which we just called utterances now. Like it's it's the sa it's sort of like one person's contribution at a time sort of thingy dingy. Okay, so yeah, so we have those, and and then we have some f field somewhere else which has topics. Yeah, and and a topic's basically they are just on the I_D_, probably with a start time or something, and and the utterances referenced to those topics I guess. So the topics don't contain any redundant thing of like showing the whole topic again, but they just sort of say a number and where they start and where they finish. And the utterances then say which topic they belong to. Yeah. No. But I was thinking of the topic segmentation now and and f for that there would only be one, right, because it's sort of like it's just a time window. Yeah. So if this lazy loading works, then this should definitely fit into I mean not memory then because it wouldn't all be in memory at the same time. So if we just have those sort of that information like a long list of all the utterances slash segments and like short or smaller lists which give weight to them. And even though probably if there's a lot of over-head in having two different files, we can probably merge the weights into it off-line. You know what I mean, like if if there's a lot of bureaucracy involved with having two different trees and whether one ties to the other because the one has the weight for the other, then it's probably quicker to just Yeah, I thought that was the whole beauty that like you can just make a new X_M_L_ file and sort of tie that to the other and and it tre Oh yeah. So no, I didn't I didn't mean tree. No. No. I meant just like handling two different files internally. Sort of c I was just thinking you know like if if the overhead for having the same amount of data coming from two d files instead of from one file is massive then it would probably be for us easy to just like off-line put the the weight into into the file that has the segments, uh yeah, segments slash utterances already. But that we can figure out I mean if it's going horrendously wrong. Yeah. Yeah. Yeah, no, we'd we'd be completely using like the whole infrastructure and basically just I mean the main difference really between our project and theirs really is that we load a different part of the data. But otherwise we're doing it the same way that they are doing it. So we just we're sort of running different types of queries on it. We in a sense we I think we are running queries, it's not just about um what we load and what we don't load, but we're l running queries in the sense that we dynamically select by by weights, don't we? That we have to check how fast that is, like to say give us all the ones that whether that works with their query language, whether that's too many results and whether we shou You know, if 'cause if it i let's say I mean if if their query language is strange and if it would return b ten million results and it can't handle it, then we can just write our individual components in the way that they know which what the threshold is. So they still get all the data and just they internally say oh no, this is less than three and I'm not gonna display it or something. Hmm? Yeah. No. I'm just thinking for this whole thing of like a different level, sort of cutting out different different pieces, whether we do that through a query where we say give us everything that's ab above this and this weight, or whether we skip the same infrastructure, but every individual module like the player and the display say like they still get sort of all the different utterances, uh all the different pieces, but they say oh, this piece I leave out, because it's below the current threshold level. When do we need the one for the meet Okay. Yeah, I guess for the so when we have the display, will we display the whole series. Then if we have for the individual topic segments within the meetings if we have ready calculated disp um measures, then we don't have to sort of extract that data from the individual utterances. Yeah, and that's also fairly easy to store along with our segments, isn't it. For the segments, are we extracting some type of title for them that we craft with some fancy algorithm or manually or we're just taking the single most highly valued key-word utterance for the segment heading? Hmm. Hmm. It's probably like in in the end probably it wouldn't be the best thing if it's just the high most highly ranked phrase or key-word because like for example for an introduction that would most definitely not be anything that has any title anywhere similar to introduction or something. Yeah. Also like for this part, maybe if we go over it with named entity in the end, if I mean w if one of the people doing DIL has some named entity code to spare, and just like at least for the for sort of for finding topics, titles for for segments, just take a named entity which has a really high, what's it called, D_F_I_D_F_, whatever. 'Cause you'd probably be quite likely if they're talking about a conference or a person, that that would be a named entity which is very highly fr um frequented in that part. Yeah, he said they're quite sparse. So that basically was don't bother basing too much of your general calculation on it. But like especially if they're sparse, probably individual named entities which describe what a what a segment is about would probably be quite good. Like if there's some name of some conference, they would could probably say that name of the conference quite often, even though he's right that they make indirect references to it. Anyway Sorry? So you're doing that on a on a per word level. Okay. Okay. Okay, cool. I was just wondering where you had the corpus from at the moment. So it it seems that the data structure isn't a big problem and that basically we don't have to have all these massive discussions of how we exactly interact with the data structure because most of our work isn't done with that data structure in memory in the browser, but it's just done off-line and everyone can ha represent it anyway they want as long as they sort of store it in a useful X_M_L_ representation in the end. So like Yeah, that would mean understanding the NITE X_M_L_ X_M_L_ sort of format in a lot more detail. We should I think we should just have a long session in the computer room together and like now that we know a bit more what we want, take a closer look at NITE X_M_L_. Mm-hmm. Mm-hmm. Good. Yeah, I haven't looked at this stuff much at all. Yeah. Yeah. Who's who's sort of doing the the the central coordination of of of the browser application now? Like Hmm? Yeah, or but also like all these elements like like the loading and, yeah, integration and and like handling the data loading and stuff. Nah. I'm sort of like I think I'll take over the display, just because I've started with a bit and found it found it doable. So somebody should sort of be the one person who's who understands most about what's t centrally going on with with the with the project, like with the with the browser as a whole and where the data comes in and Any volunteers? It's also a complicated one. Yeah. I know but uh b I guess we can do it like several people together, it's probably just those people have to work together a lot and very closely and just make sure that they're always f understand what the other one is doing. Yeah, or or ready-made versions of them for that matter and Yeah, but I think actually like at the moment the integration comes first, I mean it's sort of at the moment the building the browser comes first, and then only comes the creating new sophisticated data chunks, because that's sort of the whole thing about having a prototype system which is more or less working on on chunk data. But it at least we have the framework in which we can then test everything and and look at everything. 'Cause before we have that, it's gonna be very difficult for anyone to really see how much the work that they're doing is making sense because you just well I guess you can see something from the data that you have in your individual X_M_L_ s files files that you create, but it would be nice to have some basic system which just displays some stuff. Or just adapt like their like just sort of go from their system and and adapt that piece for piece and see how we could how we could arran like adapt it to our system. Does anyone want to like just sit with me and like play for three hours with NITE X_M_L_ at some point? Uh I wouldn't like to be 'cause I'd like to go to the gym. I'm theoretically free. But if there's any time t hmm? You have nothing no free time on Wednesday. Hmm. Nine 'til twelve and then nothi you have or you Hmm? Anytime Wednesday afternoon I'd be cool, I think. Yo, Forrest Hill, whatever one's easier to discuss stuff, I don't know. I'm not biased. Okay. What time do you wanna do? Okay, so I'll just meet you in in eighteen a in the afternoon. I guess at the moment nobody critically depends on like the NITE X_M_L_ stuff working right now, right? Like at the moment you can all do your stuff and I can do my L_S_A_ stuff. And I can even do the display to a vast degree without actually having their supplying framework working. So it's not that crucial. Yeah, actually I need the raw text as well. Yeah, but I was I was I was more thinking of the sort of the the whole browser framework as a running programme now. Yeah, I think we all need the raw text in different in different flavours, don't we? But number within the X_M_L_ context. Are they spoken numbers? Like do they look like they're utterances numbers? There's the number task, isn't there. That's part of the whole thing. Hmm? Okay. Hmm. Yeah, we have to probably cut that out anyway for our project, I don't know. It's probably gonna screw up a lot of our data otherwise. If Not sure if it what it does to document It would probably make the yeah, if if you have segments for that, probably the Okay. Uh I'm just thinking like it pro it pro probably like the L_S_A_ would perform quite well on it. It would probably find another number task quite easily seeing that it's a constrained vocabulary with a high co-occurrence of the same nine words. So that wou ten word. Hmm? Yeah. I think it's also something that they they said the numbers in order, right? Yeah, I think it's it the it sounded like they wanted to check out how well they were doing with overlapping and stuff, because basically it's like they're reading them at different speeds, but you know in which order they are said. Anyway. ICSI has some reasons for doing it. They must have been pissed off saying like numbers at the end of every meeting. Um Dave, if you would or actually for well, if you're doing I_D_F_s or you whatever you call your your frequencies, I always mix up the name, uh you need some dictionary for that at some point though, like you need to have some representation of a word as not not that specific occurrence of that word token, but of of of a given word form. Because you're making counts for word forms, right? Yeah, so we should work together on that, because I need a dictionary as well. Okay. 'Kay. Okay. Didn't you say that the o the ord Yeah, but for I'm just wondering for the whole thing. Does somebody wo who was it of you two who said that um there's some programme which spits out a dictionary probably with frequencies? Okay. Is anyone of you for the for the document frequency over total frequency, you gonna have total frequencies of words then with that, right? Like over the whole corpus sort of. Or W using which tool are you talking about? Be careful with that. Like my experience with the British National Corpus was that there's far more word types than you ever think because anything that's sort of unusual generally is a new word type. Like any typo or any strange thing where they put two words together. And also any number as a word type of its own. So you can easily end up with hundred thousands of words when you didn't expect them. So generally dictionaries can grow bigger then you think they do. Well you can probably also you can probably pre-filter like with regular expressions even just say if it consists of only dig digits, then skip it, or even if it consists any special characters, then skip it because it's probably something with a dot in between, which is usually not something you wanna have and What I did, for my project I just ignored the hundred most frequent words, because they actually end up all being articles and and everything and stuff. So we need like several of us need a dictionary. Am I the only one who needs it with frequencies? Am I the only one who needs it with frequencies? Or Frequencies. Yeah. Well I guess as soon as we have the raw text, we can probably just start with the Java hash map and like just hash map over it and see how far we get. I mean we can probably on a machine with a few hundred megabyte RAM you can go quite far. You can write it on beefy. So even if it goes wrong and even if it has a million words be Oh yeah, burning it on a like we should be able to burn the whole corpus, just the X_ hmm? Ah I see, I asked support about that two days ago. In the Informatics building there oh sorry, in in Appleton Tower five the ones closest t two machines closest to the support office. So I presume oh wait, I have the exact email. I think he's talking about sort of the ones that Yeah, if you if you enter the big room, in the right-hand corner, I think. Um the thing is like you can only burn from the local file-system. So if it's from s well actually I think if it's mounted, you can directly burn from there, but the problem is I have my data on beefy and so I have to get it into the local temp directory and burn it from there. But you can burn it from there. Uh we looked that up and I for we looked that up and I forgot. Yeah yeah. No, you you we should be able to get it at I don't think it was I don't think it was a gigabyte. Hmm. See I would off I would offer you to to get it on this one, and then um like copy it. But you know what I figured out, I'm quicker down-loading over broad-band into my computer than using this hard disk. There's something strange about the way how they access the hard disk, how they mount it, which is unfortunate. Hmm. What operating system do you have? Okay. Wh what connection do you have at home? Yeah. So if anyone of us gets it, we can then just use an ext hmm? Yeah, burn it to C_D_ or, yeah, put it on on hard disk, whatever. Question is if you're not quicker if you uh because you should get massive compression out of that. Like fifty percent or something with a good algorithm. So if you could compress it and just put it into a temp directory. Like The temp the temps usually have for gigabyte three or two. The temps, yeah. I do like I mean there's not guarantee that anything stays there, but overnight it'll stay. And I think the temps usually have. Ah yeah, but that would have to be the temp directory off the machine you can S_S_H_ into directory of S_S_H_. Yeah, they wou they'd they'd probably hate you for doing it. But They'd probably they'd like you more if you S_S_H_ uh into another computer, compress it there and then sort of copy it into the into the gateway machine. They have um if you S_S_ hey, you know, if you if you S_S_H_ and they have this big warning about doing nothing at all in the gateway machine. Yeah. To your home machine. I haven't I haven't figured out how to tunnel through the gateway into another machine yet. It's not it's not easy definitely. That's why I end up sort of copying stuff into the temp directory at the gateway machine. Sorry if this is boring everybody else. This is just details and how to get stuff home from what we can probably just look at that together when we're meeting. I'm sorry. Mm-hmm. Well yeah. As soon as somebody gives me the raw text of the whole thing, I can probably just implement like a five line Java hash table frequency dictionary builder and see Oh, did you not say frequencies f of words in the whole sorry, did uh So you'd you Yeah, you'd have to count it yourself, yeah. Oh, you don't wanna have different counts for each chunk, but just like sort of for for something from old chunks. Oh yeah, no, that's yeah, so once I write an ar like w if I write like an algorithm which does a hash um table dictionary with frequency from a raw text, then the raw text can be anything. So how far are we g uh how f how far are you getting raw text out of it do you think? Okay, well that's good, because for the dictionary the order doesn't make a difference, does it? So yeah, so um I'll get that from you and I'll write the hash table which goes over that and creates a dictionary file. So for the dictionary, is it okay if I do, whatever, word blank frequency or something? Just p could everybody sort of start from that? I mean I guess we can Yeah, I I need frequency as well. Well I think we might have a lot in common what we calculate because I for my latent semantic analysis need like counts of words within a document, uh within a a segment actually, within a topic segment. Can I convert these probabilities back into frequencies? Okay. Oh, so that's what f Rainbow does, because that's what L_S_A_ builds on. Like it builds a f a document by frequency matrix. So I could probably get that. Even though but I already have I already have my code to build it up myself. No, don't bother. I have my code already. Um Yeah, so Dave, you said you need the frequency counts actually for per document, would you say, not for the whole thing? It more and more appears to me that if we if we scrap the notion of the meeting as an individual thing and sort of ju see meetings as as topic segments and have sort of like hierarchical topic segmentation instead, then it's b like a more coherent framework. Wait, are we are we using this um for the for the for the do for the weighting in the end now, this this measure you're calculating? Because if we're doing Like I think for for the information density we uh we should calculate it on the lowest level, not on the highest. But like 'cause Yeah, but w it don't you have to like go sort of like for in a document versus the whole thing? Isn't that how it works that you c look look at r I don't think that's a good idea because isn't it like that we expect th there to change over i b with the different topic segments more? That they talk about something different in each different topic segment. 'Cause that's what relative term frequency is about, that like in some context they're talking more about a certain word than in general. So that would more be the the topic segments then. I don't know. Yeah. Yeah. Yeah. So I'm just wondering if there's ways to abandon the whole concept of of meetings and sort of but just not really treating separate meetings as too much of a separate entity. But But on algorithmic level, whether we actually whether there's some way to just represent meetings as as topics. Hmm. That's not really what I meant. But I think I have to think more about what I meant. Um g I'm confused about everything. Yeah. I'm I'm not so concerned about the m a meeting plus something else, I'm more talking about like, yeah, the keeping keeping the same algorithm and the same way of handling it and just saying like just this this topic here i uh it happens to be like a whole meeting and it has sort of sub-topics, so just that sort of topics a hierarchical concept where like a topic where there can be super-topics and topics, and the super-topics are in the end what the meetings are, but in general at some level super-topics are treated like like topics. Hmm. Mm I'm not really sure what I want. So sorry, could describe that again, the Mm-hmm. Mm-hmm. Mm-hmm. So that would be the series as a whole. That would be sort of m meetings, yeah. Yeah. I'm a I'm a I'm a bit brain-damaged at the moment, but I think I'll just sit together with you again and and go through it again. Hmm. So so I'll is th it like is this and this structurally then always identical? So that we can that we can treat it with the same algorithm or Yeah, I'm also not sure how we can go from from bottom-up. I have always thought it's like more that oh, whatever, I'm a can't think of it at the moment. Probably this is all too complicated worrying about that at that moment anyway. Now have have we have we decided anything, are we doing anything? S Wednesday we are meeting and looking at their at their implementation in some more detail to actually understand what's going on. We had two things from their stuff just to make sure that we are like understand it, we understand it enough to to m modify it. Yep. How would we do that? By just making like it w read write for everyone. 'Kay, who has most free space on their Same here. Well we alternatively we can probably just make another directory on the beefy scratch space. I mean that's where I'm having gigabytes and gigabytes of stuff at the moment. No. No. Yeah. But I think if he sends to the I think if he sends to the port he'd probably be in a better position. Yeah. Hmm. I think he said yes to that. I think uh that was like in when we were still in the seminar room, I asked that once or like ask is it possible to get it off and nobody said like people were discussing about the technical practicalities, but nobody said anything about al being allowed to or not allowed to. I mean, we have access to it here and I guess it probably means that we we can't give it to anybody else. But but if they give us access to it here o sitting on a DICE machine, then there shouldn't be a reason why we shouldn't be able to use it on our laptop. I personally don't have too many friends who would be too keen on getting it anyway. I have that really excited pirate copied thing. It annotated meeting data. Huh. Wait, wait, wait. Um sorry. Yeah, sorry. What I just realised, we should really t keep different serieses completely separate for virtually all purposes. Just let's be careful about that, because like the the ICSI corpus isn't isn't one meeting series, it's several meeting series with different people meeting for completely different things. For each meeting. Alright. Okay, but like let's just be careful that whatever we sort of we merge together, that like the highest level of merging, it's not the whole ICSI corpus but individual series.. I think we might actually I think That's probably be somewhere like well or something like it. Um I think we might just get away with for the whole project just like looking at only one series and just doing within one series. I mean you can do everything you want in one series. Oh yeah, let's take that. Is the is the data always clearly split up by different series? Uh like is it easy to just pick one Okay. Okay. Okay. Okay. So at at every level everyone has to be careful to really just take even at the highest level, just take stuff from one series and not merge stuff from different series together because they would probably be just majorly messy. Yeah, so so t so like if even if we make one single text file which has the whole corpus, sort of our corpus, that would still be from one series only. Wou but it what you're producing at the moment is like individual text files that sort of have the raw text for a whole a meeting as a whole or Mm-hmm. Yeah. 'Kay. Um so is is anybody creating an uh a real raw text thing at the moment, like which is just the words? Yeah, tha 'cause that's what I'm gonna need as well. But i but if there uh b aren't like so it's it's start and end times just for the file. Like is it just the first and the last line? Or is it for every single thing in So what do you mean by just not print out that? Okay. If you're into it, can you make a text file which just like makes just the words? 'Kay. Do you want it straight flowing, 'cause I would need something that marks the end of uh of uh is is yours segmented by topics then that like is there any information that you have to the topic, to the automated topic topic segmentation? Oh then I need something different later anyway. Okay, but for now, if you c Okay. You're gonna put that as an output of yours, the segmentation. Okay, so for now can you create like sort of just uh a dump which is pure text, just pure text so that I can get a dictionary and you can work on that for your topic segmentation. And Or for for the series. But I can but I can also deal with separate files, I mean I can just write the algorithm that it loads all files in a directory or something. But I mean if you But if you can put it in one single mega-file, that would be quite useful for me. Even though for you, wouldn't it be easier if you had different files because then you sort of know like Yeah. So give m give me different files as long as like it m if you could name them in a way that is easy to enumerate over them, like whatever, one two three four five or something. Or just anything that I can Yeah. Is is it something that's easily enu like to enumerate over? Is it some just some ordered pattern? Okay, cool. Okay, cool. Yeah. In the right order. It's just a wish list. Orders. When do you think you'll have um like a primitive segmentation by some ready-made topic segmentation by some ready-made tool ready? Okay. Okay, cool. 'Cause I'll need that then when it's done. Okay. Mm-hmm. What's what's nine megabyte? The the That sounds quite reasonable. That's nine nine characters over okay. Okay. Okay. That is for are we are we picking one particular series at the moment? Or Yes. Okay. Yeah. Yeah, I guess we can probably process the data for all different series and then check which series is the best for the presentation. It sounds quite reasonable, nine megabyte. I mean if you think if it's r roughly a million words and nine characters per word sounds realisti Yeah. Yes, I'm gonna build a dictionary then from that. Like just a list of the words that maybe a list of the words with the frequencies or a list of the words sorted alphabetically or numerically. What what does anyone want? Does this there any wishes for dictionaries? So I'll create a dictionary. Add add the structure, yeah. And then the actual file we can probably like copy from your home directory or something like it. Yeah yeah, but I'm sa I'm saying for the whole thing in the end. Then like the big thing we probably shouldn't do by email. Yeah. Oh, from the time I get the file I can do that in an afternoon, the next sort of the next morning. Oh, you mean how long processing time it takes. Ah, it's a it's a bog standard algorithm. I've I've sort of I've written it for for DIL just in half an hour or something similar. It's just you put them in a hash table and and say well if it exists already in the hash table then you increase the count by one and I'll probably implement some filter for filtering out numbers or something. Really? How do you do that? Okay, well I don't know any Perl. I mean if anyone wants to do a Perl script for that that does it does it nicely, I uh I've no problem with that. I but I think I have the Java code virtually ready because for DIL I wrote something very similar. Like for DIL I wrote something that counts the the different occurrences of all the tags um Sorry? The hash table? Uh I've never serialized anything. Wouldn't that be absolutely massive though? And then seriali and then write the serialization to a file. So you want like a se like a file which is the serialization of a hash table. Okay. Yeah. I I'll I'll check if I understand how it works. I mean otherwise I can give you the code for loading a dictionary. Give you my my it's just it's it's sort of it's a line break separated file, you know. Yeah. Yeah, I'll see if I understand how to serialize. There's a there's a serialise command so that gives me one mega mother of a s Yeah, but do they automatically write to the file anyway I'll I'll figure that out. We don't have to Yes, is that pretty much pretty much it? So Dave and me look at how NITE X_M_L_ works and we're Hmm. I'll build a dictionary as soon as I get the text. And yeah, so that When do we have to meet again then with this? How are we gonna do a demonstrator next week? My God. No no, not demonstrate, but like didn't you say that uh didn't we sort of agree that it would be useful to have a demonstrator of it, like some primitive thing working next week. That's gotta be very prototype. Mm-hmm. Ah well, let's go. Sorry. I feel like like hanging mid-air and not really like finding a point where you can get your teeth into it and start working properly and so it's all so fuzzy the whole Yeah, but it at the moment but at the moment it's also an implementational level. Like with the data structures, I'm just like over these vague ideas of some trees, I'm f Yeah. It's just we are half-way through the project time table. That's just what freaks me out. Um